Introduction to Data

Learning objectives

By the end of the lab, you will be able to …

  • implement basic variable manipulation
  • create useful frequency tables

Code-along 02

Download and open code-along-02.qmd

Packages

Load the standard packages.

library(here)
library(tidyverse)
library(gssr)
library(gssrdoc)


Install and load the summarytools package.

install.packages("summarytools")


library(summarytools)

Load your data & codebook

# Get the data only for the 2024 survey respondents
gss24 <- gss_get_yr(2024)

# Load the codebook
data(gss_dict)

Data Management I

Coding basics

You can use R to do basic math calculations:

1 + 2
[1] 3
2 * 5
[1] 10
(1 + 2) / 2
[1] 1.5

You can create new objects with the assignment operator <-:

x <- 3 * 4
x
[1] 12

You can (and should) make comments in your code

# R will ignore any text after # for that line

# create vector of primes
primes <- c(2, 3, 5, 7, 11, 13)
primes
[1]  2  3  5  7 11 13

Object names must start with a letter and can only contain letters, numbers, _, and .

i_use_snake_case
otherPeopleUseCamelCase
some.people.use.periods
And_aFew.People_RENOUNCEconvention

Operators in R

Operators in R are symbols directing R to perform various kinds mathematical, logical, and decision operations. A few of the key ones to know before we get started:

To test equality or inequality:
==, !=, >, >=, <, <=

To indicate “and”, “or”, and “not”:
& | !

Assigning values to various data objects: <- -> =

Logical operators

operator definition
< is less than?
<= is less than or equal to?
> is greater than?
>= is greater than or equal to?
== is exactly equal to?
!= is not equal to?

Logical operators (cont.)

Generally useful in a filter() but will come up in various other places as well…

operator definition
x & y is x AND y?
x \| y is x OR y?
is.na(x) is x NA?
!is.na(x) is x not NA?
x %in% y is x in y?
!(x %in% y) is x not in y?
!x is not x? (only makes sense if x is TRUE or FALSE)

Tidying Data

Most tasks related to data analysis are not glorious or fancy.

A lot of your time is dedicated to whipping your dataset into the shape that you need to be able to analyze it.


This task has different names “data cleaning,” “data management,” “data manipulation,” “data wrangling,” “data transformation.”

dplyr package

The dplyr package provides a complete set of functions that help you solve the most common data manipulation challenges such as:

  • filtering observations based on their values
  • extracting observations based on their values or positions
  • sampling observations based on a specific number or fraction of rows
  • sorting observations based on one or several variables
  • selecting variables based on their names or positions
  • renaming variables
  • adding new variables based on existing ones
  • summarizing observations or variables to a single descriptive measure
  • performing any operation by group

function(argument)

Functions are (most often) verbs, followed by what they will be applied to in parentheses:


do_this(to_this)
do_that(to_this, to_that, with_those)

The pipe |>


The pipe operator passes what comes before it into the function that comes after it as the first argument in that function.

sum(1, 2)
[1] 3


1 |> 
  sum(2)
[1] 3

dplyr style

In data transformation pipelines, always use a

  • space before |>
  • line break after |>
  • indent the next line of code

We’ll talk about data visualization pipes later…


Heads Up!

|> (native pipe operator) and %>% (magrittr package) behave identically for simple cases.

dplyr grammar

What’s the advantage of dplyr grammar? We can sequence data manipulation!


gss24 |> 
  filter(!is.na(sex))  |>
  group_by(sex) |>
  descr(hrs1,
        stats = "common") |>
  tb() 
# A tibble: 2 × 10
  sex        variable  mean    sd   min   med   max n.valid     n pct.valid
  <dbl+lbl>  <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>     <dbl>
1 1 [male]   sex       41.7  13.7     0    40    89     869  1467      59.2
2 2 [female] sex       37.3  13.7     0    40    89     891  1823      48.9

dplyr basics

dplyr verbs (functions) will allow you to solve the vast majority of your data manipulation challenges. They are organized into four groups based on what they operate on: rows, columns, groups, or tables.


The verbs all have in common:

  1. The first argument is always a data frame.
  2. The subsequent arguments typically describe which columns to operate on using the variable names (without quotes).
  3. The output is always a new data frame.

Rows: filter()

Example tibble

Let’s make a tiny data frame to use as an example:

df <- tibble(x = c(1, 2, 3, 4, 5), y = c("a", "a", "b", "c", "c"))
df
# A tibble: 5 × 2
      x y    
  <dbl> <chr>
1     1 a    
2     2 a    
3     3 b    
4     4 c    
5     5 c    

Heads Up!

A tibble is a modern data frame, often used in the tidyverse and ggplot2 packages.

Variables

Remember, you can access the variables (i.e., columns) using the $ operator, as shown using the table() function.


The variable names are case sensitive. In this dataset, all variables are lowercase.

table(gss24$fefam)

  1   2   3   4 
167 492 899 604 

195 respondents were coded as 2 on this variable. What does that mean?

Variable types

classes (character, factor, numeric)

DICHOTOMOUS (aka binary) A variable with only two categories.

NOMINAL A variable made up of categories that cannot be ordered according to rank.

ORDINAL A variable made up of ranked categories, but there is no systematic and measurable numeric difference between the categories.

INTERVAL-RATIO A variable with categories that are rank-ordered and expressed in the same units.

Frequency Distributions

Variable Descriptions

Let’s familiarize ourselves with the premarsx and polviews variables.


In the console, type ?premarsx and hit enter. The Help pane will show you the question text, response options and values.


Now, do the same for polviews.

Table

Run this code to see the frequency table for the premarsx variable. Then, add a line below to also see a table for the polviews variable.

table(gss24$premarsx)

   1    2    3    4 
 357  122  258 1378 


table(gss24$polviews)

   1    2    3    4    5    6    7 
 140  421  368 1148  381  516  186 

Cross-tabs

The table command also let’s you create a table with two variables.

# 1st variable is the rows, 2nd variable is the columns.
table(gss24$premarsx, gss24$polviews)
   
      1   2   3   4   5   6   7
  1   8  11  13  78  45 132  52
  2   1  10  10  44  18  25   8
  3   3  26  29  91  41  43  16
  4  91 227 187 488 148 145  38

Labels

Use haven::as_factor to see the value labels instead of the value numbers. Then, do the same for polviews.

table(as_factor(gss24$premarsx))

                 always wrong           almost always wrong 
                          357                           122 
         wrong only sometimes              not wrong at all 
                          258                          1378 
                        other                           iap 
                            0                          1126 
                   don't know            I don't have a job 
                           50                             0 
                  dk, na, iap                     no answer 
                            0                             6 
                not imputable                       refused 
                            0                             0 
               skipped on web                    uncodeable 
                           12                             0 
not available in this release    not available in this year 
                            0                             0 
                 see codebook 
                            0 

Labels

table(as_factor(gss24$polviews))

            extremely liberal                       liberal 
                          140                           421 
             slightly liberal  moderate, middle of the road 
                          368                          1148 
        slightly conservative                  conservative 
                          381                           516 
       extremely conservative                    don't know 
                          186                            99 
                          iap            I don't have a job 
                            0                             0 
                  dk, na, iap                     no answer 
                            0                            20 
                not imputable                       refused 
                            0                             0 
               skipped on web                    uncodeable 
                           30                             0 
not available in this release    not available in this year 
                            0                             0 
                 see codebook 
                            0 

Better Labels

Let’s clean up the levels for premarsx.

gss24$premarsx <- zap_missing(gss24$premarsx)
gss24$premarsx <- as_factor(gss24$premarsx)
table(gss24$premarsx) 
1
Get rid of all the ‘missing’ levels (just missing)
2
Apply the labels instead of numeric values

        always wrong  almost always wrong wrong only sometimes 
                 357                  122                  258 
    not wrong at all                other 
                1378                    0 

Better Labels cont.

Let’s get rid of the empty levels in premarsx.

gss24$premarsx <- droplevels(gss24$premarsx)
table(gss24$premarsx)

        always wrong  almost always wrong wrong only sometimes 
                 357                  122                  258 
    not wrong at all 
                1378 

Manipulating Variables

For polviews, let’s combine categories to ease interpretation. This is easiest when the levels are numeric.

Let’s remind ourselves what the values correspond with each label.

table(as_factor(gss24$polviews, levels = "both")) # both shows value and label

             [1] extremely liberal                        [2] liberal 
                               140                                421 
              [3] slightly liberal   [4] moderate, middle of the road 
                               368                               1148 
         [5] slightly conservative                   [6] conservative 
                               381                                516 
        [7] extremely conservative                    [NA] don't know 
                               186                                 99 
                          [NA] iap            [NA] I don't have a job 
                                 0                                  0 
                  [NA] dk, na, iap                     [NA] no answer 
                                 0                                 20 
                [NA] not imputable                       [NA] refused 
                                 0                                  0 
               [NA] skipped on web                    [NA] uncodeable 
                                30                                  0 
[NA] not available in this release    [NA] not available in this year 
                                 0                                  0 
                 [NA] see codebook 
                                 0 

Manipulating Variables

gss24 <- gss24 |>
 mutate(pol3cat = case_when(
   polviews >= 1 & polviews <= 3 ~ "Liberal",
   polviews == 4 ~ "Moderate",
   polviews >= 5 & polviews <= 7 ~ "Conservative",
   TRUE ~ NA_character_),
  pol3cat = factor(pol3cat,
                 levels = c("Liberal", "Moderate", "Conservative"))
  )
1
Save over the dataset with an added variable.
2
Creates a new variable by assigning labels based on values of polviews
3
Set everything else to “missing”
4
Convert the variable to factor with specified level order


can be written as |> or %>%

Frequency Table

Always double check your work.


table(gss24$polviews, gss24$pol3cat)
   
    Liberal Moderate Conservative
  1     140        0            0
  2     421        0            0
  3     368        0            0
  4       0     1148            0
  5       0        0          381
  6       0        0          516
  7       0        0          186

Relative Frequency Table

Make a frequency table. One of summarytools main purposes is to help cleaning and preparing data for further analysis. Pay attention to the missing values. Then, do the same for premarsx.


freq(gss24$pol3cat) 
Frequencies  
gss24$pol3cat  
Type: Factor  

                     Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
------------------ ------ --------- -------------- --------- --------------
           Liberal    929     29.40          29.40     28.07          28.07
          Moderate   1148     36.33          65.73     34.69          62.77
      Conservative   1083     34.27         100.00     32.73          95.50
              <NA>    149                               4.50         100.00
             Total   3309    100.00         100.00    100.00         100.00

Relative Frequency Table

freq(gss24$premarsx) 
Frequencies  
gss24$premarsx  
Type: Factor  

                             Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
-------------------------- ------ --------- -------------- --------- --------------
              always wrong    357     16.88          16.88     10.79          10.79
       almost always wrong    122      5.77          22.65      3.69          14.48
      wrong only sometimes    258     12.20          34.85      7.80          22.27
          not wrong at all   1378     65.15         100.00     41.64          63.92
                      <NA>   1194                              36.08         100.00
                     Total   3309    100.00         100.00    100.00         100.00

Pretty Tables

Using report.nas = FALSE suppresses the missing data.
The headings = FALSE parameter suppresses the heading section. Do the same for premarsx.


freq(gss24$pol3cat, report.nas = FALSE, headings = FALSE) 

                     Freq        %   % Cum.
------------------ ------ -------- --------
           Liberal    929    29.40    29.40
          Moderate   1148    36.33    65.73
      Conservative   1083    34.27   100.00
             Total   3160   100.00   100.00

Pretty Tables

freq(gss24$premarsx, report.nas = FALSE, headings = FALSE) 

                             Freq        %   % Cum.
-------------------------- ------ -------- --------
              always wrong    357    16.88    16.88
       almost always wrong    122     5.77    22.65
      wrong only sometimes    258    12.20    34.85
          not wrong at all   1378    65.15   100.00
                     Total   2115   100.00   100.00

Cross-tab

The table() function gives us the frequencies.


table(gss24$premarsx, gss24$pol3cat)
                      
                       Liberal Moderate Conservative
  always wrong              32       78          229
  almost always wrong       21       44           51
  wrong only sometimes      58       91          100
  not wrong at all         505      488          331


We want to add the column percentages…

Relative Frequency Cross-tab

ctable(gss24$premarsx, gss24$pol3cat,
       prop = "c",
       format = "p",
       useNA = "no")
1
Change from table() to ctable().
2
The “c” gives column %; “r” would give row %.
3
This adds the % symbols to the table.
4
Exclude the missing levels from the table.